Credit Card users churn prediction

Context

Thera bank recently saw a steep decline in the number of users of their credit card. Credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit card services would lead banks to loss, so the bank wants to analyse the data of customers and identify the customers who will leave their credit card services and reason for the same – so that the bank could improve upon those areas.

The aim is to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards.

Objective

Data Dictionary

Importing Important libraries

Loading and Exploring dataset

There are no duplicate rows in the dataset.

Since the client number column has all unique values and it will not be adding any value to our model, let us drop the column.

Observations:

We can see that educaional_level and maital_status column have null values

Attrition_Flag is our dependant variable. It contains 2 values - "Attrited Customer" and "Existing Customer", let us encode it to 0 and 1 respectively.

Univariate Analysis

Observations:

Observations on Gender

Observations:

Observations on Education_Level

Observations:

Observation on Marital_Status

Observations:

Observation on Income_Category

Observations:

Observations on Card_Category

Observations:

Bivariate Analysis

Observations:

Observations:

Missing Value Treatment

Education_level and marital_status have missing values.

Missing Value treatment for educational level

Let us replace the missing values with unknown

Missing Value treatment for marital status

Let us replace the missing values with unknown

We have treated the columns containing null values

Outlier detection and treatment

Calculate the outliers in each column

Observations:

Feature Engineering

Log Transformation

'Credit_Limit','Total_Revolving_Bal','Avg_Open_To_Buy','Total_Amt_Chng_Q4_Q1','Total_Trans_Amt','Total_Ct_Chng_Q4_Q1' and 'Avg_Utilization_Ratio' is quite skewed. Let us apply log transformations and check if we can reduce this skewness.

Log Transformation of Credit_limit

There is a reduction in skeweness. So let us create a new column with log values for credit_limit.

Log Transformation of Total_Revolving_Bal

The data became left skewed after applying log trasformatio. So let us keep it as it is.

Log Transformation of Avg_Open_To_Buy

The graph looks better with log transformation. Let us add this as new column.

Log Transformation of Total_Amt_Chng_Q4_Q1

There is no much difference here. So let this column be as it is.

Log Trasformation of Total_Trans_Amt

The graph looks better with log transformation. Let us add this as new column.

Log Transformation of Total_Ct_Chng_Q4_Q1

There is no much difference here. So let this column be as it is.

Log Transformation of Avg_Utilization_Ratio

There is no much difference here. So let this column be as it is.

One Hot Encoding

Let us convert the columns with object type to categorical type

Look at the unique values in the columns havig categorical datatype

In the Income_category there is a value abc. Let us replace this value with unknown.

Calculate the ratio of true or false in the outcome variable (Attrition_Flag)

We can see that attrited customers are 16% and existing customers are 83%. The outcome is imbalanced.

Split the data

The data is split into Train set, Validation set and Test set in the ratio 60:20:20 respectively.

Let us built the functions to create confusion matrix for validation set, test set and to calculate metrics - accuracy, precision, recall and F1 score

Model Building

Metric of Interest

What does the bank want?

To minimise the losses. There are 2 types of losses here.

Which loss is greater?

Customer leaving the credit card services.

Since we do not want the customers to stop using the credit card, we will be using recall as the scoring metric.

Logistic Regression

The confusion matrix

Decision Tree

The confusion matrix

Bagging Classifier

The confusion matrix

Random Forest Classifier

The confusion matrix

AdaBoost Classifier

The confusion matrix

GradientBoosting Classifier

The confusion matrix

Oversampling data

Oversampling train data using SMOTE

### Now let us train the models using Oversampled data

Logistic Regression on oversampled data

The confusion matrix

Decision tree on oversampled data

The confusion matrix

Bagging Classifier on oversampled data

The confusion matrix

Random Forest Classifier on oversampled data

The confusion matrix

AdaBoost Classifier on oversampled data

The confusion matrix

Gradient Boosting Classifier on oversampled data

The confusion matrix

Undersampling data

Let us undersample the data using Random Under Sampler

Now let us train the models using undersampled data

Logistic Regression on undersampled data

The confusion matrix

Decision tree on undersampled data

The confusion matrix

Bagging Classifier on undersampled data

The confusion matrix

Randon Forest Classifier on undersampled data

The confusion matrix

AdaBoost Classifier on undersampled data

The confusion matrix

Gradient Boosting Classifier on undersampled data

The confusion matrix

Comparing the models

Observations:

Logistic Regression:

Decision Tree:

Bagging CLassifier:

Random Forest Classifier:

AdaBoost Classifier:

Gradient Boosting Classifier:

Let us consider AdaBoost Classifier on original dataset, Gradient Boosting Classifier on original dataset and Gradient Boosting classifier with oversampling to perform hyperparameter tuning as they are performing well when compared with all the other models and check if there is any increase in the performance.

Tuning using AdaBoost Classifier on Original train dataset

Let us check the best cv score

The confusion matrix

The Recall had increased by 1% after tuning.

Tuning using Gradient Boosting Classifier on Original train dataset

Let us check the best cv score

The confusion matrix

There is no change in the recall when compared to the original model

Tuning using Gradient Boosting Classifier on Oversampled train dataset

Let us check the best cv score

The confusion matrix

The Recall of the validation set have dropped which means the model is overfitting the data.

Comparing the tuned models

Observations:

Predict the Performance on test set

The Recall of the test set is the same as that of the validation set. So the model is not overfitting the data.

Pipeline

Let us create a pipeline which does the following:

The Score is almost the same for the train, validation and test set which indicates that our model is not overfitting the data.

Insights:

Logistic Regression:

Decision Tree:

Bagging CLassifier:

Random Forest Classifier:

AdaBoost Classifier:

Gradient Boosting Classifier:

After Performing Hyperparameter tuning on AdaBoost Classifier on original dataset, Gradient Boosting Classifier on original dataset and Gradient Boosting classifier with oversampling:

From the final pipeline we go the model score of 98% for all the tree sets, Train set, validation set and test set.

Recommendations